For the development of this project, we'll be following the Analytics workflow seen in class:
Dataset including all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) from 2018 to 2020.
First of all, we will show the definitions of felony, misdemeanor and violation for a better understanding of the project:
Misdemeanor: Offenses lower than felonies and generally those punishable by fine, penalty, Forfeiture, or imprisonment other than in a penitentiary. Under federal law, and most state laws, any offense other than a felony is classified as a misdemeanor. Certain states also have various classes of misdemeanors.
Felony: A serious crime, characterized under federal law and many state statutes as any offense punishable by death or imprisonment in excess of one year. Crimes classified as felonies include, among others, Treason, Arson, murder, rape, Robbery, Burglary, Manslaughter, and Kidnapping.
Violation: An act done unlawfully and with force, involving carnal knowledge.
Borough: Division of New York City.
Column. Name: Description - Type
import pandas as pd
import seaborn as sns
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import scipy.cluster.hierarchy as sch
# Set Pandas to show all the columns
pd.set_option('display.max_columns', None)
data_path = './data/NYPD_Complaint_Data_Current__Year_To_Date_.csv'
original_data = pd.read_csv(data_path, delimiter=',')
original_data.head()
Once the Challenge/Problem definition and Data gathering stages are covered, the next step is the Exploration Data Analysis and Transformation. In this stage, Visual Analytics takes a key role to explore and understand the features and their relationships between them. Thanks to this process, we will have more context to determine the next steps to be done according to the Challenge/Problem definition.
Analyze the main characteristics (type of variables, number of records, nulls, etc...) of the variables of the dataset.
original_data.info()
Looking at the definitions of the variables, we decided to remove some of them that are not useful for what we need. They are:
#We decided to remove those variables since they are not useful
data=original_data.drop(['CMPLNT_TO_DT','CMPLNT_TO_TM', 'HADEVELOPT', 'HOUSING_PSA', 'KY_CD', 'LOC_OF_OCCUR_DESC', 'PD_CD', 'RPT_DT', 'X_COORD_CD', 'Y_COORD_CD', 'Lat_Lon', 'New Georeferenced Column', 'JURISDICTION_CODE'], axis = 1)
Our dataset has 306656 registers. At the beginning it consisted of 36 columns, but after deleting the non useful ones we end having 24 columns.
data.info()
data.isnull().sum()
As we said before, now the dataset has 306656 registers and 24 variables. There are some variables with null values, which can be seen above. The variables with more than 100.000 null values are: PARKS_NM, STATION_NAME and TRANSIT_DISTRICT.
Other variables with null values are:
#Dropping the entries with null values in the variables previously explained:
data= data[(data['PREM_TYP_DESC'].notnull())]
data= data[(data['VIC_AGE_GROUP'].notnull())]
data= data[(data['PD_DESC'].notnull())]
data= data[(data['OFNS_DESC'].notnull())]
print("Number of registers: ", len(data))
data.isnull().sum()
Now we have 305766 entries, which only contain null values in variables that makes sense to keep them. But instead of having them as NULL, for the categorical variables we don't know we will set those values to "UNKNOWN". In the case of parks or transport, we will keep them as NULL since if a crime did not occur in a park, it is not that the park is unknown, but that there should be no park at all (NULL).
data["BORO_NM"]=data["BORO_NM"].astype(object).fillna("UNKNOWN")
data["SUSP_AGE_GROUP"]=data["SUSP_AGE_GROUP"].astype(object).fillna("UNKNOWN")
data["SUSP_RACE"]=data["SUSP_RACE"].astype(object).fillna("UNKNOWN")
data["SUSP_SEX"]=data["SUSP_SEX"].astype(object).fillna("UNKNOWN")
We are going to look at each variable and see if they have the correct data type. If not, we will change them.
data.info()
Converting datetimes variables to the correct format and type
#Converting dates
data["CMPLNT_FR_DT"]=pd.to_datetime(data['CMPLNT_FR_DT'], format="%m/%d/%Y", errors = 'coerce')
#data["CMPLNT_FR_DT"]
#We will filter complaints form 2019-09-30 since the last record is from 2020-09-30. This way we have one year.
data= data[(data['CMPLNT_FR_DT'] > '2019-09-30')]
#Converting hours
data['CMPLNT_FR_TM']= pd.to_datetime(data['CMPLNT_FR_TM'], errors = 'coerce')
#Keep only those registers with ages in a valid group, since there are some wrong entries.
data= data[(data['VIC_AGE_GROUP']=="18-24") | (data['VIC_AGE_GROUP']=="25-44") | (data['VIC_AGE_GROUP']== "45-64") | (data['VIC_AGE_GROUP']=="65+") | (data['VIC_AGE_GROUP']== "<18") | (data['VIC_AGE_GROUP']== "UNKNOWN")]
data= data[(data['SUSP_AGE_GROUP']=="18-24") | (data['SUSP_AGE_GROUP']=="25-44") | (data['SUSP_AGE_GROUP']== "45-64") | (data['SUSP_AGE_GROUP']=="65+") | (data['SUSP_AGE_GROUP']== "<18") | (data['SUSP_AGE_GROUP']== "UNKNOWN")]
Converting categorical variables
# Define the numerical variables
num_variables = [column for column, datatype in data.dtypes.items() if datatype in (np.int64, np.float64)]
date_variables=['CMPLNT_FR_DT', 'CMPLNT_FR_TM']
# Define the categorical ones
categorical_variables = [column for column in data.columns if column not in num_variables+date_variables]
print(categorical_variables)
#Printing the number of unique values of the categorical variables, and transforming them to their corresponding type.
for variable in categorical_variables:
print(f'Unique values in {variable} : {len(data[variable].unique())}')
data[variable] = data[variable].astype('category')
We plot the distributions of complaints by weekday to see if there is any interesting information. But we observe that the numbers of complaints are very similar independent of the weekday.
plt.title("Number of complaints classified by day of the week")
data['CMPLNT_NUM'].groupby(data["CMPLNT_FR_DT"].dt.weekday).count().plot(kind="bar")
Here we group and plot the complaints by time. We observe that the hours with less reported reports are from 2 to 7 am, which makes sense since at those hours most people are sleeping. However, during midnight there are also crimes.
plt.title("Number of complaints distributed by hours")
data['CMPLNT_FR_TM'].groupby(data["CMPLNT_FR_TM"].dt.hour).count().plot(kind="bar")
#Defining a function that will be used several times to plot a countplot of a determined variable in a dataframe
def plot_countplot(data, x):
ax = sns.countplot(x=x, data=data)
total= data[x].value_counts().sum()
plt.title("Countplot of "+x)
for p in ax.patches:
txt = str((p.get_height()/total*100).round(2)) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
ax.text(txt_x,txt_y,txt)
plt.xticks(rotation=90)
plt.show()
Plotting countplots of the categorical variables:
for variable in categorical_variables:
if len(data[variable].unique())<20:
plot_countplot(data, variable)
With these plots, we can extract some information about the crimes in New York in a general scale:
#Defining a function that will be used to plot a variable distinguishing the values of another one.
def plot_by_variable(data, var1, var_hue):
ax = sns.catplot(x=var1, hue=var_hue, kind="count",palette="cubehelix", data=data, height=4, aspect=2)
ax.set_xticklabels(rotation=30).set_titles("{col_name} {col_var}")
return ax
Now we show some plots that will be later on explained in the visualizations, but that are useful now to see what we can expect.
ax=plot_by_variable(data, "VIC_RACE", "VIC_SEX")
ax.set(title='Plot of victims race classified by their sex')
ax=plot_by_variable(data, "VIC_AGE_GROUP", "VIC_SEX")
ax.set(title='Plot of victims age group classified by their sex')
ax=plot_by_variable(data, "BORO_NM", "VIC_SEX")
ax.set(title='Plot of crimes done in boroughs classified by the sex of the victims')
ax=plot_by_variable(data, "BORO_NM", "SUSP_SEX")
ax.set(title='Plot of crimes done in boroughs classified by the sex of the suspects')
ax=plot_by_variable(data, "BORO_NM", "SUSP_RACE")
ax.set(title='Plot of crimes done in boroughs classified by the race of the suspects')
"""10 parks with more crimes in NYC"""
data['PARKS_NM'].value_counts()[:10].plot.bar()
plt.ylabel('Count')
plt.title('10 most dangerous parks in NYC')
plt.show()
"""10 most common crimes in NYC"""
data['OFNS_DESC'].value_counts()[:10].plot.bar()
plt.ylabel('Count')
plt.title('10 most common crimes in NYC')
plt.show()
"""10 stations with more crimes in NYC"""
data['STATION_NAME'].value_counts()[:10].plot.bar()
plt.ylabel('Count')
plt.title('10 stations with more crimes in NYC')
plt.show()
Let's look at the OFNS_DESC "DANGEROUS DRUGS" to see which are the most common offenses related to drugs
data[data["OFNS_DESC"]=="DANGEROUS DRUGS"].head()
total=len(data[data["OFNS_DESC"]=="DANGEROUS DRUGS"]["PD_DESC"])
for desc in data[data["OFNS_DESC"]=="DANGEROUS DRUGS"]["PD_DESC"].unique():
summ=(data[data["OFNS_DESC"]=="DANGEROUS DRUGS"]["PD_DESC"]==desc).sum()
percent =round(summ/total*100,2)
print("For ", desc, ":", percent, "%")
#We see that controlled substance, possession is the most common drug delict.
In the following plot we see the days with more offenses reported. We see that the two more dangerous days are the first of June and the first day of the year. This data makes sense since at New Year's Eve a lot of discussions tend to happen. Later on we will se a map of the distribution of crimes in that specific day, showing how crimes evolve through the hours.
bydays=data['CMPLNT_NUM'].groupby(data["CMPLNT_FR_DT"]).count()
bydays
bydays.sort_values(ascending=False)[:10].plot(kind="bar")
We want to see if there is any type of crime that is more frequent in the early-morning hours than during the day, or if they follow the same distributions.
#We distribute the dataset in two dataframes, one for the night (2 to 7 am) and another for the day (the rest of hours)
early_morning_df= data[(data["CMPLNT_FR_TM"].dt.hour >= 2)& (data["CMPLNT_FR_TM"].dt.hour <=7)]
during_day_df=data[(data["CMPLNT_FR_TM"].dt.hour < 2)| (data["CMPLNT_FR_TM"].dt.hour >7)]
In the following plot, we will show the most common offenses at night compared to the most common offenses during the day.
fig,axes=plt.subplots(1,2, figsize=(15, 5))
ofenses_night=early_morning_df['OFNS_DESC'].groupby(early_morning_df["OFNS_DESC"]).count()
ofenses_night.sort_values(ascending=False)[:10].plot(kind="bar",ax=axes[0])
axes[0].title.set_text('Most common offenses at night')
ofenses_day=during_day_df['OFNS_DESC'].groupby(during_day_df["OFNS_DESC"]).count()
ofenses_day.sort_values(ascending=False)[:10].plot(kind="bar",ax=axes[1])
axes[1].title.set_text('Most common offenses during the day')
plt.show()
We can see that the most common offense is Petit Larceny, both at night and during the day. However, during the night harassment are slightly less frequent, and robberies are more common. So we observe that the types of offenses that occur during the day and night are very similar, but some small differences can be seen.
In the next plot we will see again the difference between day and night, but this time using the variable PD_DESC, that shows a more descriptive definition of the crime.
fig,axes=plt.subplots(1,2, figsize=(15, 5))
ofenses_night=early_morning_df['PD_DESC'].groupby(early_morning_df["PD_DESC"]).count()
ofenses_night.sort_values(ascending=False)[:10].plot(kind="bar", ax=axes[0])
axes[0].title.set_text('Most common offenses at night (more descriptive)')
ofenses_day=during_day_df['PD_DESC'].groupby(during_day_df["PD_DESC"]).count()
ofenses_day.sort_values(ascending=False)[:10].plot(kind="bar", ax=axes[1])
axes[1].title.set_text('Most common offenses during the day (more descriptive)')
plt.show()
Again, we see that most offenses are similarly distributed for day and night. Eventhough, we can notice that larcenies to stores are more common during the day while criminal mischiefs are more common at night.
Now we will create a dataset containing only the crimes classified as violations. In this first plot, we see the most common places where harassment takes place. We notice that they are most frequent at houses than in the street, meaning that a big quantity of offenses are probably commited by relatives or roommates.
VIOLATIONS=data[data["LAW_CAT_CD"]=="VIOLATION"]
VIOLATIONS['PREM_TYP_DESC'].value_counts()[:10].plot.bar()
plt.ylabel('Count')
plt.title('Most common places of harassment')
plt.show()
The next plot shows the distribution of sexs for victims and suspects of violation crimes. By comparing them with the general plots, we see that in violations most of the victims are females (63% females vs 35% males), while in the general crimes it was almost equally distributed (40% females and 37% males). And the opposite occurs with the suspects, in the general distribution 45% of crimes were commited by males. In violation crimes, they are responsible of 56% o the offenses. So from this information it can be extracted that violations tend to be committed by men to women.
#Compare with the general ones
plot_countplot(VIOLATIONS, 'VIC_SEX')
plot_countplot(VIOLATIONS, 'SUSP_SEX')
We will perform a cluster regarding the profiles of victims, suspects and the level of offense, so we create a dataframe with the needed variables:
data_clustering=data[["LAW_CAT_CD", "SUSP_AGE_GROUP", "SUSP_RACE", "SUSP_SEX", "VIC_AGE_GROUP", "VIC_RACE", "VIC_SEX"]]
#We divide the data_clustering in two dataframes: one with the unknown values and the other one with the known ones, since we
#will perform the clustering over the profiles that have been properly identified. Later on, we will try to predict some
#characteristics of suspects that have not been identified.
data_cluster_topredict=data_clustering[(data_clustering["VIC_RACE"]=="UNKNOWN") | (data_clustering["VIC_AGE_GROUP"]=="UNKNOWN") | (data_clustering["VIC_SEX"]=="UNKNOWN")
| (data_clustering["SUSP_RACE"]=="UNKNOWN") | (data_clustering["SUSP_AGE_GROUP"]=="UNKNOWN") | (data_clustering["SUSP_SEX"]=="UNKNOWN")
| (data_clustering["SUSP_SEX"]=="U")]
data_cluster = data_clustering.drop(index=data_cluster_topredict.index)
data_cluster.head()
First of all, we are going to encode the different values to numbers, so that we can perform the clustering.
!pip install category_encoders
import category_encoders as ce
# create object of Ordinalencoding
encoder= ce.OrdinalEncoder(cols=["LAW_CAT_CD", "SUSP_AGE_GROUP", "SUSP_RACE", "SUSP_SEX", 'VIC_SEX', 'VIC_AGE_GROUP', 'VIC_RACE'],return_df=True,
mapping=[{'col': "LAW_CAT_CD", 'mapping':{"FELONY": 1, "MISDEMEANOR": 0}},#VIOLATION IS -1
{'col':'SUSP_SEX', 'mapping':{'M':-1,'F':1, 'E': 0, 'D': 0}},
{'col': "SUSP_AGE_GROUP", 'mapping':{'<18': 14, '18-24': 22, '25-44': 35, '45-64': 55, '65+': 75}},
{'col': "SUSP_RACE", 'mapping': {"AMERICAN INDIAN/ALASKAN NATIVE": -10, "ASIAN / PACIFIC ISLANDER": -6, "WHITE": -2,
"WHITE HISPANIC": 2, "BLACK HISPANIC": 6, "BLACK": 10}},
{'col':'VIC_SEX', 'mapping':{'M':-1,'F':1, 'E': 0, 'D': 0}},
{'col': "VIC_AGE_GROUP", 'mapping':{'<18': 14, '18-24': 22, '25-44': 35, '45-64': 55, '65+': 75}},
{'col': "VIC_RACE", 'mapping': {"AMERICAN INDIAN/ALASKAN NATIVE": -10, "ASIAN / PACIFIC ISLANDER": -6, "WHITE": -2,
"WHITE HISPANIC": 2, "BLACK HISPANIC": 6, "BLACK": 10}},
])
#Transformed data
df_train_transformed = encoder.fit_transform(data_cluster)
#We will plot the distributions of the variables now that the dataframe only contains identified people, and with the
#variables already encoded. This step is just to check everything works as expected.
for var in df_train_transformed:
plot_countplot(df_train_transformed, var)
#Choosing optimal number of clusters based on elbow method
sse={}
for num_clusters in list(range(1,8)):
kmeans = KMeans(n_clusters=num_clusters, random_state=333)
kmeans.fit(df_train_transformed)
sse[num_clusters]=kmeans.inertia_
plt.title("Elbow criterion method chart")
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()
print("With the elbow method we see that the optimal number of clusters for this dataset is 3.")
#After identifying the optimal number of clusters, we will classify the complaints in three clusters, using the variables
#of suspects and victims, plus the type of crime.
kmeans=KMeans(n_clusters=3)
kmeans.fit(df_train_transformed)
clustering=kmeans.labels_
clustered_data = pd.DataFrame(df_train_transformed)
#we will add a column in the datasets corresponding to the computed cluster
clustered_data["Cluster"]=clustering
data_cluster["Cluster"]=clustering
Now we export the dataframe with the computed clusters to do some visualizations with Tableau in order to give some useful information.
#aran_outcomes = '/content/drive/MyDrive/Visual Analytics/VISUAL ANALYTICS NURIA & ARAN/data_clustered.csv'
#nuria_outcomes = "/content/drive/My Drive/VISUAL ANALYTICS NURIA & ARAN/data_clustered.csv"
data_cluster.to_csv('data_clustered.csv', sep=';')
data_cluster.head()
The three following cells show the description of the three clusters, where the variables are encoded in numbers. This is not very useful since with Tableau we will extract more interesting visualizations. But just as a small hint, we can detect that the ages of suspects in the cluster 0 are all around 31 years, and the victims also around 30. In contrast, the cluster 0 contains suspects all between 55 and 75 years, and victims with the same ages. Finally, cluster 2 contains suspects between 14 and 35 years, and victims from 55 to 75 years old. So we can see that the clusters have distributed the offenses in three kinds: crimes made from adults to adults, crimes from more elderly people to more elderly people, and crimes from young people to elderly ones.
clustered_data[clustered_data["Cluster"]==0].describe()
clustered_data[clustered_data["Cluster"]==1].describe()
clustered_data[clustered_data["Cluster"]==2].describe()
Now we will try to predict the race of suspects that have not been identified.
#Add BORO_NM column to data_clustering dataframe, since it is a variable that we will use to train the model
boros=data.loc[:, "BORO_NM"]
boros
data_clustering["BORO_NM"]=boros
for value in data_clustering["SUSP_RACE"].unique():
print(value, ": ", len(data_clustering[data_clustering["SUSP_RACE"]==value]))
#We encode the variables again since now there is a new attribute: the borough
encoder_with_boro= ce.OrdinalEncoder(cols=["BORO_NM", "LAW_CAT_CD", "SUSP_AGE_GROUP", "SUSP_RACE", "SUSP_SEX", 'VIC_SEX', 'VIC_AGE_GROUP', 'VIC_RACE'],return_df=True,
mapping=[{'col': "BORO_NM", 'mapping':{"BRONX": 2, "BROOKLYN": 1, "MANHATTAN": 0, "QUEENS": -2}},#STATEN ISLAND IS -1
{'col': "LAW_CAT_CD", 'mapping':{"FELONY": 1, "MISDEMEANOR": 0}},#VIOLATION IS -1
{'col':'SUSP_SEX', 'mapping':{'M':-1,'F':1, 'E': 0, 'D': 0}},
{'col': "SUSP_AGE_GROUP", 'mapping':{'<18': 14, '18-24': 22, '25-44': 35, '45-64': 55, '65+': 75}},
{'col': "SUSP_RACE", 'mapping': {"AMERICAN INDIAN/ALASKAN NATIVE": -10, "ASIAN / PACIFIC ISLANDER": -6, "WHITE": -2,
"WHITE HISPANIC": 2, "BLACK HISPANIC": 6, "BLACK": 10}},
{'col':'VIC_SEX', 'mapping':{'M':-1,'F':1, 'E': 0, 'D': 0}},
{'col': "VIC_AGE_GROUP", 'mapping':{'<18': 14, '18-24': 22, '25-44': 35, '45-64': 55, '65+': 75}},
{'col': "VIC_RACE", 'mapping': {"AMERICAN INDIAN/ALASKAN NATIVE": -10, "ASIAN / PACIFIC ISLANDER": -6, "WHITE": -2,
"WHITE HISPANIC": 2, "BLACK HISPANIC": 6, "BLACK": 10}},
])
#We divide the dataframe in two: one with the unknown values of suspect races and the other one with the known ones
predict_susp_race=data_clustering[(data_clustering["SUSP_RACE"]=="UNKNOWN")]
predict_susp_race = encoder_with_boro.fit_transform(predict_susp_race)
train_susp_race = data_clustering.drop(index=predict_susp_race.index)
train_susp_race = encoder_with_boro.fit_transform(train_susp_race)
print("We have ", len(train_susp_race), "suspects with their race identified, and ", len(predict_susp_race), "whose race we don't know.")
print("We wanted to predict the race for the unknown ones, however we see that the amount of data we know is almost the same that the data we do not know.")
print("Therefore, it will be very difficult to get a good model that predicts correctly the race, also taking into account that all variables have been endoded.")
print("However, we will try to do these predictions since in any case it is more useful to know the race of the suspect with a 60% of probability that have no idea at all, so this information can be useful for the police, even though it is not very precise.")
#y: target variable, the race of the suspect.
#X: the rest of variables
y=train_susp_race["SUSP_RACE"]
X=train_susp_race.drop(["SUSP_RACE"], axis = 1)
y_topredict=predict_susp_race["SUSP_RACE"]
X_topredict=predict_susp_race.drop(["SUSP_RACE"], axis = 1)
#We split the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
To do the predictions, we will use some different classifiers and see which one performs best. Then, we will use the one with best accuracy to predict the races of the suspects.
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(random_state = 1)# Fit and evaluate models
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_topredict)
gb.score(X_test,y_test)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state = 1)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_topredict)
rf.score(X_test,y_test)
from sklearn.ensemble import AdaBoostClassifier
ab = AdaBoostClassifier(random_state = 1)
ab.fit(X_train, y_train)
y_pred_ab = ab.predict(X_topredict)
ab.score(X_test,y_test)
#Since the best accuracy has been given by GradientBoostingClassifier, we will use those predictions
predict_susp_race["SUSP_RACE"]=y_pred_gb
#Here we have the encoded dataset with the predicted races of the victims, that should be bring back to
#categories to be used by the police. However, we will let this for them.
predict_susp_race
For a better understanding of our data, and taking advantage of the spatial variables that we have got, part of our project consist on the spatial analysis of the characteristics of the complaints in the city of New York.
First of all, we install and import some requirements from the .py files on the folder, that will help us with the usage of geodata.
from requirements import *
from ppca import PPCA
from utils import *
For this section, we will get back the cleaned data used above. For a quick recall:
print(data.shape)
data.head(3)
In this next step, we treat the Longitude and Latitude variables as Geo variables, in order to get the exact location of each of the complaints recorded on our dataset.
from geopandas import points_from_xy
#take the Longitude and Latitude columns to transform the pandas DataFrame into a GeoDataFrame
gdf = GeoDataFrame(data, geometry=points_from_xy(data['Longitude'], data['Latitude']))
#The GeoDataframe can be spatially plotted
Map(Layer(gdf, encode_data=False))
As previously anticipated, now we will take only those crimes commited in the first day of the year, and see how they evolve during the hours by using an animated map. There is a timeline where the hours can be selected, or play it and let it run. We classify the crimes in 3 colors depending on their jurisdiction (police, housing or transit). We have selected this day since we think it is interesting to see the distribution of crimes in New Year's day in a more visual and animated way. You will see that for that specific day, instead of being committed during the day (as we saw before that crimes in general are more frequent at day than at night) the majority of the cases are during the early morning hours. Probably, because it is a conflictive night in which people party a lot and that generates several complaints. Also we see that most offenses correspond to the Police department and very few belong to the transit one, which also makes sense.
data_time=data.copy()
data_time["hours"]=data_time["CMPLNT_FR_TM"].dt.hour
data_time.sort_values(by = ['hours'], inplace=True)
data_oneday=data_time[(data_time['CMPLNT_FR_DT'] == '2020-01-01')]
data_oneday["size"] = 1
px.set_mapbox_access_token("pk.eyJ1IjoibnVyaWF2IiwiYSI6ImNraWx6ZGRmNjBvYmYyeXBrazFsMjlzOHoifQ.2snfdHzFakwXOjjL51pMtg")
fig = px.scatter_mapbox(data_oneday, lat="Latitude", lon="Longitude",
color="JURIS_DESC", size="size", animation_frame="hours",
color_continuous_scale=px.colors.cyclical.IceFire, size_max=15, zoom=9, center={'lat': 40.73692, 'lon': -73.9519})
fig.show()
data_oneday = data_oneday.drop('size', axis = 1)
In a similar way, we have decided to create another animated map, but this time, the timeline will be of days instead of hours. For a faster and better running compilation, we have decided to filter out the data and take only those complaints felony-related. We appreciate that the number of registered complaints increases heavily on 2020, being that year by far more informative than the previous one.
#Animated map of felony complaints per date
Map(
Layer(
gdf[gdf.LAW_CAT_CD == 'FELONY'],
style=animation_style('CMPLNT_FR_DT', color='blue', size=5, duration=40, fade_in=0.5, fade_out=0.5),
widgets=animation_widget(title='Date'), encode_data=False
)
)
To get more information from the location of each of the complaints, we will import a geojson with the boundaries of the different Neighborhood Tabulation Areas (NTA) within the city of New York. Following this step, we will be able to study the data with spatial constraints, i.e. filter the complaints by NTA or Borough (District).
data_path = './data/NYAreas.geojson'
#geojson with the boundaries of the Neighborhood Tabulation Areas of the city of NY
ny_nta = gpd.read_file(data_path)
ny_nta.head()
We check that the datatypes of all the columns are correct (by changing the county_fips and boro_code to int types that will be solved).
ny_nta['county_fips'] = ny_nta['county_fips'].astype('int64')
ny_nta['boro_code'] = ny_nta['boro_code'].astype('int64')
ny_nta.info()
Just as before, we can plot the geojson file into a map and see the boundaries of each of the NTAs of New York.
plot_size = (920, 400)
basicstyle = cartoframes.viz.basic_style(color = '#eaff00', stroke_color = '#eaff00', stroke_width = 2, opacity = 0.2)
Map(Layer(ny_nta, basicstyle, encode_data=False), size = plot_size,show_info=True)
Once both the complaints' dataset and the geojson with the different NTAs are prepared and ready, we start with the key-step of spatial analysis, joining both datasets regarding the location of the complaints, and thus adding to them, the correct NTA in which they are situated. The function from geopandas sjoin provides the solution to our need.
complete_gpd = gpd.sjoin(left_df=gdf, right_df=ny_nta, how='inner')
complete_gpd.head(5)
complete_gpd.info(memory_usage=False)
We can now filter the complaints by Boroughs and also by smaller areas, the NTAs. As an example, here we filter those complaints that are located only in West Farms-Bronx River, Bronx.
Map(Layer(complete_gpd[complete_gpd.ntaname == 'West Farms-Bronx River']))
As we have done previously in the project, we will use the external platform Tableau, to visually represent our spatial data and thus, be able to better aproach our project. The dashboard can be seen by clicking in the following link: https://public.tableau.com/profile/aran3436#!/vizhome/NewYorkComplaints/Dashboard1?publish=yes
complete_gpd.to_csv('Compaints_GeoDF.csv', sep=';')
After doing the visualizations, we though it would be nice to add a map with the parks and another one with stations to give more support to our analysis and conlcusions taken from the dashboards. The data filtered with those complaints recorded on public parks or stations are painted according to the district. We appreciate that, as stated in the report and by visually observing the Tableau, Manhattan is clearly the district with more complaints on its public places.
#Map with complaints recorded on Stations
viewport = {'zoom': 9.8, 'lat':40.739261, 'lng': -73.862240}
color_bins_st = color_bins_style('boro_name', bins=len(complete_gpd['boro_name'].unique()), breaks=complete_gpd['boro_name'].unique())
color_bins_leg = color_bins_legend(title='Public Stations', description='Different Districts', footer='Color of the blob')
Map([Layer(complete_gpd[(complete_gpd['STATION_NAME'].notnull())], color_bins_st, color_bins_leg
)], show_info=True, viewport=viewport)
#Map with complaints recorded on Parks
viewport = {'zoom': 9.8, 'lat':40.739261, 'lng': -73.862240}
color_bins_st = color_bins_style('boro_name', bins=len(complete_gpd['boro_name'].unique()), breaks=complete_gpd['boro_name'].unique())
color_bins_leg = color_bins_legend(title='Public Parks', description='Different Districts', footer='Color of the blob')
Map([Layer(complete_gpd[(complete_gpd['PARKS_NM'].notnull())], color_bins_st, color_bins_leg
)], show_info=True, viewport=viewport)
Now we will change a little the whole perception of the project and the way we see our data. Instead of having a dataset with information about each of the complaints, we will now gahter the different instances and group them by their location, in other words, study the characteristics of each of the NTAs regarding the complaints that are located within.
This way we would be able to study and analyse the situation in each of the different NTAs of New York.
To do that, we have to encode the categorical variables (those that are not binary) with one-hot-encoding. By using the get_dummies() function will be able to approach our challenge correclty.
#we will only take the categorical columns we are interested in.
interesting_cols = ['LAW_CAT_CD', 'OFNS_DESC']
suspect_cols = ['SUSP_AGE_GROUP', 'SUSP_RACE', 'SUSP_SEX']
vic_cols = ['VIC_AGE_GROUP', 'VIC_RACE', 'VIC_SEX']
needed_cols = ['ntaname', 'CMPLNT_NUM']
nta_gdf = complete_gpd[interesting_cols+suspect_cols+vic_cols+needed_cols]
print(nta_gdf.shape)
nta_gdf.info()
The total number of resulting columns will be the sum of all the unique values of the current variables.
for var in interesting_cols+suspect_cols+vic_cols:
print(f'Variable {var}: {len(complete_gpd[var].unique())} \n')
As explained, we will group the data by NTA names, and it will result into the splitting of the columns on their unique values, and that's why we need numerical values instead, in order to sum them up altogheter. This can be done by using the get_dummies function.
for col in interesting_cols+suspect_cols+vic_cols:
nta_gdf = pd.get_dummies(nta_gdf, columns = [col])
nta_gdf
Since the next step is grouping the instances by common ntaname and summing up all the values from the columns, we will have to change the id numbers of the complaints by 1, since we are only interested in the total count, not specifically the id number.
nta_gdf['CMPLNT_NUM'] = 1
We now group the instances by ntaname and sum up all the values of their different columns
#After having grouped the instanced by their ntaname, we now have a dataframe with all
#the instances being the nta's of New York
nta_gdf = nta_gdf.groupby(['ntaname']).sum()
nta_gdf.head(3)
Again, we are interested in the spatial location of our data, so using the geojson data, we merge the information with the index of our current dataset, the NTA's names
#As before, we add the spatial variables to the dataframe, so we can map-plot our data.
nta_gdf = pd.merge(nta_gdf, ny_nta, on='ntaname')
nta_gdf.head(3)
Before creating some maps, we thought it was good to show some statistics. Staten Island is the district with less complaints and the difference is quite noticeable. We also appreciate that Queens and Bronx have an overall count of complaints similar, but when it comes to the mean, the second one is considerably higher. In a similar way, we notice that although Brooklyn concentrates big part of the complaints within its area, due to the large number of neighborhoods, the mean number is reduced to a similar value as Bronx, making Manhattan the district with more complaints per neighborhood by a large difference.
stat_isl = nta_gdf[nta_gdf.boro_name == 'Staten Island']['CMPLNT_NUM']
queens = nta_gdf[nta_gdf.boro_name == 'Queens']['CMPLNT_NUM']
bronx = nta_gdf[nta_gdf.boro_name == 'Bronx']['CMPLNT_NUM']
brooklyn = nta_gdf[nta_gdf.boro_name == 'Brooklyn']['CMPLNT_NUM']
manhattan = nta_gdf[nta_gdf.boro_name == 'Manhattan']['CMPLNT_NUM']
print(f' Staten Island. Total Complaints: {stat_isl.sum()}. Mean: {round(stat_isl.mean(),2)}')
print(f' Queens. Total Complaints: {queens.sum()}. Mean: {round(queens.mean(),2)}')
print(f' Bronx. Total Complaints: {bronx.sum()}. Mean: {round(bronx.mean(),2)}')
print(f' Brooklyn. Total Complaints: {brooklyn.sum()}. Mean: {round(brooklyn.mean(),2)}')
print(f' Manhattan. Total Complaints: {manhattan.sum()}. Mean: {round(manhattan.mean(),2)}')
Bearing in mind the statistics above, now it is time to plot some maps and see them more visually. This total complaints map is the accomplishment of this final part of the project, differentiating the areas of New York by the spatial data of the complaints.
# Neighborhood Tabulation Areas colored by total complaints
Map([Layer(nta_gdf,geom_col = 'geometry', style=color_bins_style('CMPLNT_NUM'), title='Total Complaints',
widgets = histogram_widget('CMPLNT_NUM', title = 'Histogram of Total Complaints'))], show_info=True)
To finish, we put everything together in order to create some really interesting maps with different offenses that paint the background of the map, and the complaints that follow the pattern and are somehow filtered, representing a sample of the data as points.
In this first one map, we paint the neighborhood according to the number of Dangerous Drugs Offenses Complaints, which we observe are more common in Manhattan, Bronx, specially on their intersection, and finally in the center of Brooklyn. The points represent a sample of this crime commited by people from 18 to 24, and are coloured by race.
# Neighborhood Tabulation Areas painted by count of Dangerous Drugs offenses. Points being suspects coloured by race.
viewport = {'zoom': 8.7, 'lat':40.690433, 'lng': -74.148701}
breaks = ['WHITE HISPANIC', 'WHITE', 'BLACK HISPANIC', 'BLACK']
color_bins_st = color_bins_style('SUSP_RACE', bins=len(breaks), breaks=breaks, palette='mint')
color_bins_leg = color_bins_legend(title='Suspicious Race', description='Different Races', footer='Color of the blob')
Map([Layer(nta_gdf,geom_col = 'geometry', style=color_bins_style('OFNS_DESC_DANGEROUS DRUGS', palette='pinkyl'),
title='Dangerous Drugs Offense',
widgets = histogram_widget('OFNS_DESC_DANGEROUS DRUGS', title = 'Histogram of Dangerous Drugs')),
Layer(complete_gpd[(complete_gpd.OFNS_DESC == 'DANGEROUS DRUGS') & (complete_gpd.SUSP_AGE_GROUP == '18-24')
& ((complete_gpd.SUSP_RACE == 'WHITE HISPANIC') | (complete_gpd.SUSP_RACE == 'WHITE') | (complete_gpd.SUSP_RACE == 'BLACK') | (complete_gpd.SUSP_RACE == 'BLACK HISPANIC') )],
color_bins_st, color_bins_leg)], show_info=True, viewport=viewport)
In this second map, similarly as before, we paint the neighborhood according to the number of Harrassment 2 Offense Complaints, which we observe are maybe a little bit more spreaded, and thus we see red areas in Manhattan, Bronx, and Brooklyn, with some in Queens too. The points represent a sample of this crime commited by Men to underaged Women.
# Neighborhood Tabulation Areas painted by count of Harassement 2 offenses. Points being Male suspects that acted on underaged Female victims
viewport = {'zoom': 8.7, 'lat':40.690433, 'lng': -74.148701}
breaks = ['<18']
color_bins_st = color_bins_style('VIC_AGE_GROUP', bins=len(breaks), breaks=breaks, palette='mint')
color_bins_leg = color_bins_legend(title='Victim Age Group', description='Different Age Ranges', footer='Color of the blob')
Map([Layer(nta_gdf,geom_col = 'geometry', style=color_bins_style('OFNS_DESC_HARRASSMENT 2', palette='pinkyl'),
title='Harrassment 2 Offense',
widgets = histogram_widget('OFNS_DESC_HARRASSMENT 2', title = 'Histogram of Harrasment 2')),
Layer(complete_gpd[(complete_gpd.OFNS_DESC == 'HARRASSMENT 2') & (complete_gpd.VIC_SEX == 'F')
& (complete_gpd.SUSP_SEX == 'M') & (complete_gpd.VIC_AGE_GROUP == '<18')],
color_bins_st, color_bins_leg)], show_info=True, viewport=viewport)
Finally, in this last map, we paint the neighborhood according to the number of Dangerous Weapons Offense Complaints, which we observe are highly located in Bronx, and its intersection with Manhattan, but we find some red areas in the center of Brooklyn and Queens too. The points represent a sample of this crime commited by people aged 25 to 44 age and coloured according to their sex.
# Neighborhood Tabulation Areas painted by count of DANGEROUS WEAPONS offenses. Points being suspects colored on genre
viewport = {'zoom': 8.7, 'lat':40.690433, 'lng': -74.148701}
breaks = ['F', 'M']
color_bins_st = color_bins_style('SUSP_SEX', bins=len(breaks), breaks=breaks, palette='mint')
color_bins_leg = color_bins_legend(title='Suspect Sex', description='Different Sex', footer='Color of the blob')
Map([Layer(nta_gdf,geom_col = 'geometry', style=color_bins_style('OFNS_DESC_DANGEROUS WEAPONS', palette='pinkyl'),
title='Larceny of Motor Vehicle Offense',
widgets = histogram_widget('OFNS_DESC_DANGEROUS WEAPONS', title = 'Histogram of Motor Vehicle Larceny')),
Layer(complete_gpd[(complete_gpd.OFNS_DESC == 'DANGEROUS WEAPONS') & ((complete_gpd.SUSP_SEX == 'F')
| (complete_gpd.SUSP_SEX == 'M')) & (complete_gpd.VIC_AGE_GROUP == '25-44')],
color_bins_st, color_bins_leg)], show_info=True, viewport=viewport)
For a little recap of this last maps, we see that the neighborhoods in the intersection of Manhattan and Bronx are usually painted red, meaning they are on top of the count of complaints, and being followed by their respective districts, we encounter the central part of Brooklyn, which also tends to be painted of the darkest color of the scale. Then, we will probably have the center area of Queens. What we see is that Staten Island usually records very little complaints.
On the third section of our Report we go deeper into our analysis and conlcusions.